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Abstract 

Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for proba- 
bilistic topic modeling, which attracts worldwide interests and touches on many important 
applications in text mining, computer vision and computational biology. This paper intro- 
duces a topic modeling toolbox (TMBP) based on the belief propagation (BP) algorithms. 
TMBP toolbox is implemented by MEX C++/Matlab/Octave for either Windows 7 or 
Linux. Compared with existing topic modeling packages, the novelty of this toolbox lies in 
the BP algorithms for learning LDA-based topic models. The current version includes BP 
algorithms for latent Dirichlet allocation (LDA), author-topic models (ATM), relational 
topic models (RTM), and labeled LDA (LaLDA). This toolbox is an ongoing project and 
more BP-based algorithms for various topic models will be added in the near future. In- 
terested users may also extend BP algorithms for learning more complicated topic models. 
The source codes are freely available under the GNU General Public Licence, Version 1.0 
at https://mloss.org/software/view/399/. 
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1. Introduction 



X 



The p ast decade has seen rapid development of latent Dirichlet allocation (LDA) (|Blei et al 



2003) for solving topic modeling problems because of its elegant three-layer graphical repre- 
sentat ion as well as tw o efficient approximate inference meth ods such as Variationa l Baye s 
(VB) (|Blei et all I200I ) and collapsed Gibbs Sampling (GS) ((Griffiths and Stevversl . liooi ). 
Both VB and GS have been widely used to learn variants of LDA-based topic models until 
our recent work ( Zeng et al. . 201 ll ) reveals that there is yet another learning algorithm for 
LDA based on loopy belief propagation (BP). The basic idea of BP is inspired by the col- 
lapsed GS algorithm, in which the three-layer LDA can be interpreted as being collapsed into 



a two -layer Markov random field (MRF) represented by a factor graph (jKschischang et al 



2001 



) . The BP algorithm such as the sum-product operates well on the factor graph (jBishop 



2006). Extensive experiments confirm that BP is faster and more accurate than both VB 
and GS, and thus is a strong candidate for becoming the standard topic modeling al- 
gorithm. For example, we show how to lea rn three typical varian ts of LDA-based topic 



models, suc h as author-topi c mod els (ATM) (IRosen-Zvi et all 120041) . relational topic mod- 



BP based on the novel factor graph representations (jZeng et al 



els (RTM) (jGhang and Bleil . l20ld ). and labeled LDA CLaLDA) (Ramage et all 120031 using 



2011 
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We have implemented the topic modeling toolbox called TMBP by MEX C++ in the 
Matlab/Octave interface based on VB, GS and BP algorithms. Compared with other topic 
modeling packages, 1 ' 2 ' 3 ' 4 ' 5,6 the novelty of this toolbox lies in the BP algorithms for topic 
modeling. This paper introduces how to use this toolbox for basic topic modeling tasks. 



2. Belief Propagation for Topic Modeling 

Given a document-word matrix x = {x Wt d} (x w ,d 1S the number of word counts at the 
index {w, d}) with word indices 1 < w < W in the vocabulary and document indices 
1 < d < D in the corpus, the probabilistic topic modeling task is to allocate topic labels 



z = i z w d}> z w d e {"> -^ii Efc=i 2 m d = 1,1 < k < K to partition the nonzero elements 
x w d i n t° K topics (provided by the user) according to three topic modeling rules: 



1. Co-occurrence: the different word indices w in the same document d tend to have the 
same topic label. 

2. Smoothness: the same word index uu in the different documents d tend to have the 
same topic label. 

3. Clustering: all word indices w do not tend to be associated with the same topic label. 

Based on the above rules, recent approximate inference methods compute the marginal dis- 
tribution of topic l abel Uv,,d(k) = p ( z* , d = 1) called message, and estimate parameters using 



the iterative EM ([Dempster et all 119771 ) algorithm according to the maximum-likelihood 
criterion. The major difference among these inference methods lies in the message update 
equation. VB updates messa ges by complicated digamma functions, which cause bias and 
slow down message updating (jZeng et al.l . l201ll ). GS updates messages by topic labels ran- 
domly sampled from the message in the previous iteration. The sampling process does 
not keep all uncertainty encoded in the previous message. In contrast, BP directly uses 
the previous message to update the current message without samp ling. Similar ide as have 
also been proposed within the approximate mean- field frame work ( Asuncion!, I2010T ) as the 
zero-order approximation of collapsed VB (CVBO) algorithm (| Asuncion et all 12009 ). While 
proper settings of hyperparameters can make the topic m odeling performance comparable 
among different inference methods ( Asuncion et al. . 20091 ) . we still advocate the BP algo- 
rithms because of their ease of use and fast speed. Table [T] compares the message update 
equations among VB, GS and BP. Compared with BP, VB uses the digamma function \& 
in message update, and GS uses the discrete count of sampled topic labels n^ d based on 
word tokens rather than word index in message update. The Dirichlet hyperparameters a 
and j3 can be viewed as the pseudo-messages. The notations — w and —d denote all word 
indices except w and all document indices except d, and — i deno tes all word tokens excep t 



the current word token i. More details can be found in our work dZeng et all 1201 1 l2012«J > . 



1 http: //www. cs .princeton.edu/~blei/lda- 


-c/ index .html 


http: //psiexp . ss .uci . edu/research/programs_data/toolbox .htm , 


http: //nip. Stanford. edu/sof twar e/tmt/tmt-0 . 3/ 


http: / /CRAN .R- project . org/package=lda 




J http: //mallet . cs . umass . edu/ 




6 http : //www . arbylon . net /pro j ects/ 
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Table 1: Comparison of message update equations (Zeng et al.. 201lh. 


Inference methods 


Message update equations 


VB 




GS 


/W^J oc Efc[n -. (fc)+a] - Eiu[n - : .( fc)+/3 ] 


BP 







Because VB and GS have been widely used for learning different LDA-based topic mod- 
els, it is easy to develop the corresponding BP algorithms for learning these LDA-based 
topic models by either removing the digamma function in the VB or without sampling from 
the posterior probability in the GS algorithm. For example, we show how to develop the 
corres ponding BP algor ithms for two typical LDA-based topic models such as ATM and 
RTM (|Zeng et all . l201ll ). 



3. An Example of Using TMBP 



TMBP toolb ox contains source codes for learning LDA ba sed on VB, GS, and BP (jZeng et al. 



201 ll . l2012al |bh. learning author-topic models (ATM) dRosen-Zvi et ail, hooi ) based on 
GS a nd BP, learning relat ional topic models (RTM) (jChang and Bleil . I201Q ) and labeled 
LDA (jRamaee et all 120091 ) using BP. Implementation details can be found in "readme". 

Here, we present a demo for the synchronous BP algorithm. After installation, we run 
demol.m in the Octave/Matlab environment. The results (the training perplexity at every 
10 iterations and the top five words in each of ten topics) are printed on the screen: 



The sBP Algorithm 

Iteration 10 of 500: 



1041.620873 



Iteration 490 of 500: 741.946849 
Elapsed time is 13.246747 seconds . 

Top five words in each of ten topics by sBP 

design system reasoning case knowledge 
model models bayesian data markov 
genetic problem search algorithms programming 
algorithm learning number function model 
learning paper theory knowledge examples 
learning control reinforcement paper state 
model visual recognition system patterns 
research report technical grant university 
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network neural networks learning input 

data decision training algorithm classification 
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